Candidate word generation for OCR errors using optimization algorithm

نویسندگان

چکیده

OCR post-processing is an important step to improve text accuracy. It includes two main tasks, error detection and correction. Hill climbing algorithm a heuristic search method used for solving optimization problems. In this paper, we present novel correction approach using adapted version of the algorithm. Correction candidates errors are explored by random character edits evolved with climbing. The edit patterns obtained from training data. proposed model evaluated on benchmark dataset in post-correction competition International Conference Document Analysis Recognition 2017. shown that our outperforms various baseline approaches competition. addition, randomness analyzed verify its stability under parameter configurations.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Optimization of grid independent diesel-based hybrid system for power generation using improved particle swarm optimization algorithm

The power supply of remote sites and applications at minimal cost and with low emissions is an important issue when discussing future energy concepts. This paper presents modeling and optimization of a photovoltaic (PV)/wind/diesel system with batteries storage for electrification to an off-grid remote area located in Rafsanjan, Iran. For this location, different hybrid systems are studied and ...

متن کامل

Text Deblurring Using OCR Word Confidence

Objective of this paper is to propose a new Deblurring method for motion blurred textual images. This technique is based on estimating the blur kernel or the Point Spread Function of the motion blur using Blind Deconvolution method. Motion blur is either due to the movement of the camera or the object at the time of image capture. The point spread function of the motion blur is governed by two ...

متن کامل

Strategies for Reducing and Correcting OCR Errors

In this paper we describe our efforts in reducing and correcting OCR errors in the context of building a large multilingual heritage corpus of Alpine texts which is based on digitizing the publications of various Alpine clubs. We have already digitized the yearbooks of the Swiss Alpine Club from its start in 1864 until 1995 with more than 75,000 pages resulting in 29 million running words. Sinc...

متن کامل

Hyperdocument Generation using OCR and Icon Detection

In this contribution we consider the construction of hyperdocuments; converting scanned paper documents into electronic hypertext. Hyperlink creation is automated by analyzing the structure and content of the scanned document. The focus is on hyperlinks between the text and labels in a picture. A number of tools for such hyperlink detection are described. Practical results are presented.

متن کامل

Word Segmentation for Urdu OCR System

This paper presents a technique for Word segmentation for the Urdu OCR system. Word segmentation or word tokenization is a preliminary task for understanding the meanings of sentences in Urdu language processing. Several techniques are available for word segmentation in other languages but not much work has been done for word segmentation of Urdu Optical Character Recognition (OCR) System. A me...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Nucleation and Atmospheric Aerosols

سال: 2021

ISSN: ['0094-243X', '1551-7616', '1935-0465']

DOI: https://doi.org/10.1063/5.0066687